Imagine we’re some fancy data scientists exploring - once again - the gapminder data. We’re particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS recommended using this tidyverse thing in order to complete our tasks. At the same time, they also hesitate to load all of its R-packages at once.

1

Load all packages from the tidyverse for importing Excel data and for data wrangling.
You can find them in the slides from A3 and A5.

Ok, that wasn’t too hard. But data science is about data, so we have to load in the data.

2

Import the GDP gapminder data. Make sure to only import the excel sheet named “Data”.
Individual sheets can be chosen by applying the option sheet = "name_of_your_sheet"

Have the data been succesfully imported? They should comprise a tibble of 275 x 53. Furthermore, the income per person for Algeria of the years 1960, 1961, and 1962 should be 1280, 1085, and 856.

3

Proof that the income per person for Algeria of the years 1960, 1961, and 1962 are 1280, 1085, and 856
Algeria is in the 5th row of the dataset and the relevant variables are in the first four columns. You can subset datasets also by selecting rows by number with select() and by filtering by number with slice().

Let’s say we’re interested in the earliest 10 years of development in all countries and in the most recent 10 years. The idea is that there might be some differences between the early days and the new days of GDP development. At first, we’d like to compute such statistics across all countries. Unfortunately, the data are in the wide format.

4

Re-arrange the data such that they are in the long format.
Remember that the command for converting wide format data to long format is gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() as its really messy.

Ok, did it work out? There are still a lot of missing values we might get rid of, and the data are not arranged in a proper way. They make the data untidy, distract us and are not part of any mean calculations anyway. For the next upcoming tasks, simply re-use your code and add the next commands with the %>%.

5

Remove all missing values and arrange the data in ascending years and the GDP.
There are several ways to exclude missing values. The most convenient one is to use filter() in combination with !is.na.

Nice. Now we got a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we still got the values for all years between these two time periods in the data. But we decided that we leave them there for some future analyses. As such, we do all analysis on the fly. Let’s start with the first time period.

6

Calculate the mean value of GDP across all countries for each of the first ten years.
As the year variable is a double you can simply filter the range of years you are interested in.

After this was done, you might know how to do that for the 10 most recent years…

7

Calculate the mean value of GDP across all countries for each of the last ten years.